{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e0dd4a63-5f87-43e9-a67f-ca9e00b48ede",
   "metadata": {},
   "source": [
    "<!-- Banner Image -->\n",
    "<img src=\"https://uohmivykqgnnbiouffke.supabase.co/storage/v1/object/public/landingpage/brevdevnotebooks.png\" width=\"100%\">\n",
    "\n",
    "<!-- Links -->\n",
    "<center>\n",
    "  <a href=\"https://brev.nvidia.com\" style=\"color: #06b6d4;\">Console</a> •\n",
    "  <a href=\"https://developer.nvidia.com/brev\" style=\"color: #06b6d4;\">Docs</a> •\n",
    "  <a href=\"/\" style=\"color: #06b6d4;\">Templates</a> •\n",
    "  <a href=\"https://discord.gg/NVDyv7TUgJ\" style=\"color: #06b6d4;\">Discord</a>\n",
    "</center>\n",
    "\n",
    "# Finetune and deploy the new Llama3-8b model using SFT and VLLM 🤙\n",
    "\n",
    "Welcome!\n",
    "\n",
    "It's April 18th and Llama3 just dropped! So let's figure out how to finetune it and deploy it so you can start using it :). In order to get access you'll need to request via [HuggingFace](https://huggingface.co/meta-llama/Meta-Llama-3-8B). It took me about 20 minutes to get access to so ahead and sign up to get your token enabled!\n",
    "\n",
    "Meta released a couple different models today. \n",
    "1. Llama-3-8B\n",
    "2. Llama-3-8b-instruct\n",
    "3. Llama-3-70b\n",
    "4. Llama-3-70b-instruct\n",
    "\n",
    "Llama-3 was has an 8k context length which is pretty small compared to some of the newer models that have been released and was trained with trained with 15 trillion tokens on a 24k GPU cluster. Luckily for finetuning, we only need a fraction of that compute power.\n",
    "\n",
    "In this notebook, we're going to walk through the flow of finetuning our model from scratch using the base model and then deploy it using VLLM. We'll be releasing a guide soon that uses Direct Preference Optimization as a way of finetuning soon!\n",
    "\n",
    "**Note that we will be using the base model in this notebook and not the instruct model. Additionally, we will be running through a full finetune with no quantization. This notebook requires 2 A100-80GB GPUs which might cost a bit more. If you're looking for a lighter Llama3-finetune, checkout out our Llama3 finetune that uses Direct Preference Optimization [here](https://github.com/brevdev/notebooks/blob/main/llama3-dpo.ipynb).**\n",
    "\n",
    "#### Help us make this tutorial better! Please provide feedback on the [Discord channel](https://discord.gg/T9bUNqMS8d) or on [X](https://x.com/brevdev).\n",
    "\n",
    "A note about running Jupyter Notebooks: Press Shift + Enter to run a cell. A * in the left-hand cell box means the cell is running. A number means it has completed. If your Notebook is acting weird, you can interrupt a too-long process by interrupting the kernel (Kernel tab -> Interrupt Kernel) or even restarting the kernel (Kernel tab -> Restart Kernel). Note restarting the kernel will require you to run everything from the beginning."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01a7bfcf-0552-433f-8afb-145e661ab34a",
   "metadata": {},
   "source": [
    "## Table of Contents\n",
    "1. Install dependancies\n",
    "2. Download model\n",
    "3. Fintuning flow\n",
    "4. Deploy as an OpenAI compatible endpoint"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2fab017-7a3d-4322-8ec7-bc20eabe7e9a",
   "metadata": {},
   "source": [
    "## 1. Install Dependancies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bace79c2-2f05-4c2b-bbea-db212e93839d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39611709-247b-4a18-9962-60f9ce167926",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q -U bitsandbytes\n",
    "!pip install -q -U git+https://github.com/huggingface/transformers.git\n",
    "!pip install -q -U git+https://github.com/huggingface/peft.git\n",
    "!pip install -q -U git+https://github.com/huggingface/accelerate.git\n",
    "!pip install trl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2be21a74-407f-49a9-83e9-771cf90d29b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import torch\n",
    "from datasets import load_dataset\n",
    "from transformers import (\n",
    "    AutoModelForCausalLM,\n",
    "    AutoTokenizer,\n",
    "    BitsAndBytesConfig,\n",
    "    HfArgumentParser,\n",
    "    TrainingArguments,\n",
    "    pipeline,\n",
    "    logging,\n",
    ")\n",
    "from peft import LoraConfig, PeftModel\n",
    "from trl import SFTTrainer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67588314-d055-4403-b8d3-0f9662a3148e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Input access token\n",
    "from huggingface_hub import notebook_login\n",
    "\n",
    "notebook_login()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3fc909e-efb6-4b4b-8c10-e0ed72477600",
   "metadata": {},
   "source": [
    "## 2. Load in Llama 3 and our dataset\n",
    "\n",
    "Because we are using the base model, there is not an exact prompt template we have to follow. The dataset we are using follows LLama3's template format so it should be fine for downstream tasks that use the Llama3 chat format. If you're bringing your own data, you can format it however you want as long as you use the same formatting downstream. \n",
    "\n",
    "Here's the official [Llama3 chat template](https://huggingface.co/blog/llama3#how-to-prompt-llama-3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b422e214-977a-4c45-9d66-b6a698c546eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "base_model_id = \"meta-llama/Meta-Llama-3-8B\"\n",
    "dataset_name = \"scooterman/guanaco-llama3-1k\"\n",
    "new_model = \"brev-llama3-8b-SFT\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3a03b3f-5398-455e-880c-f503cba51821",
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "dataset = load_dataset(dataset_name, split=\"train\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5081c400-f9c0-403b-b145-0e75e5a9982e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n",
    "\n",
    "model = AutoModelForCausalLM.from_pretrained(base_model_id, device_map=\"auto\")\n",
    "tokenizer = AutoTokenizer.from_pretrained(\n",
    "    base_model_id,\n",
    "    add_eos_token=True,\n",
    "    add_bos_token=True, \n",
    ")\n",
    "tokenizer.pad_token = tokenizer.eos_token"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77919e07-7495-4a50-82c7-99459ba815c2",
   "metadata": {},
   "source": [
    "## 3. Set our Training Arguments\n",
    "\n",
    "A lot of tutorials simply paste a list of arguments leaving it up to the reader to figure out what each argument does. Below I've added comments which explain what each argument does!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "614e0a54-2740-438d-a72c-f3c0215c6558",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Output directory where the results and checkpoint are stored\n",
    "output_dir = \"./results\"\n",
    "\n",
    "# Number of training epochs - how many times does the model see the whole dataset\n",
    "num_train_epochs = 1 #Increase this for a larger finetune\n",
    "\n",
    "# Enable fp16/bf16 training. This is the type of each weight. Since we are on an A100\n",
    "# we can set bf16 to true because it can handle that type of computation\n",
    "bf16 = True\n",
    "\n",
    "# Batch size is the number of training examples used to train a single forward and backward pass. \n",
    "per_device_train_batch_size = 4\n",
    "\n",
    "# Gradients are accumulated over multiple mini-batches before updating the model weights. \n",
    "# This allows for effectively training with a larger batch size on hardware with limited memory\n",
    "gradient_accumulation_steps = 2\n",
    "\n",
    "# memory optimization technique that reduces RAM usage during training by intermittently storing \n",
    "# intermediate activations instead of retaining them throughout the entire forward pass, trading \n",
    "# computational time for lower memory consumption.\n",
    "gradient_checkpointing = True\n",
    "\n",
    "# Maximum gradient normal (gradient clipping)\n",
    "max_grad_norm = 0.3\n",
    "\n",
    "# Initial learning rate (AdamW optimizer)\n",
    "learning_rate = 2e-4\n",
    "\n",
    "# Weight decay to apply to all layers except bias/LayerNorm weights\n",
    "weight_decay = 0.001\n",
    "\n",
    "# Optimizer to use\n",
    "optim = \"paged_adamw_32bit\"\n",
    "\n",
    "# Number of training steps (overrides num_train_epochs)\n",
    "max_steps = 5\n",
    "\n",
    "# Ratio of steps for a linear warmup (from 0 to learning rate)\n",
    "warmup_ratio = 0.03\n",
    "\n",
    "# Group sequences into batches with same length\n",
    "# Saves memory and speeds up training considerably\n",
    "group_by_length = True\n",
    "\n",
    "# Save checkpoint every X updates steps\n",
    "save_steps = 100\n",
    "\n",
    "# Log every X updates steps\n",
    "logging_steps = 5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ba492aa-ad13-46d4-be7b-923240b6a133",
   "metadata": {},
   "outputs": [],
   "source": [
    "training_arguments = TrainingArguments(\n",
    "    output_dir=output_dir,\n",
    "    num_train_epochs=num_train_epochs,\n",
    "    per_device_train_batch_size=per_device_train_batch_size,\n",
    "    gradient_accumulation_steps=gradient_accumulation_steps,\n",
    "    optim=optim,\n",
    "    save_steps=save_steps,\n",
    "    logging_steps=logging_steps,\n",
    "    learning_rate=learning_rate,\n",
    "    weight_decay=weight_decay,\n",
    "    bf16=bf16,\n",
    "    max_grad_norm=max_grad_norm,\n",
    "    max_steps=max_steps,\n",
    "    warmup_ratio=warmup_ratio,\n",
    "    group_by_length=group_by_length,\n",
    "    report_to=\"wandb\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7bec1455-c08d-4671-bb3c-48cf07ec86a8",
   "metadata": {},
   "source": [
    "## Run our training job using WandB for logging\n",
    "\n",
    "Weights and Biases is industry standard for monitoring and evaluating your training job. I highly suggest setting up an account to monitor this run and use it for future ML jobs!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1250d948-e191-45a4-bd57-1fd9226666a0",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pip install wandb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f57e00a2-a90a-4e43-a79d-2fe98dda58f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import wandb\n",
    "\n",
    "wandb.login()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c93b0c17-77af-4328-889a-d418d0814101",
   "metadata": {},
   "outputs": [],
   "source": [
    "training_arguments = TrainingArguments(\n",
    "    output_dir=output_dir,\n",
    "    num_train_epochs=num_train_epochs,\n",
    "    per_device_train_batch_size=per_device_train_batch_size,\n",
    "    gradient_accumulation_steps=gradient_accumulation_steps,\n",
    "    optim=optim,\n",
    "    save_steps=save_steps,\n",
    "    logging_steps=logging_steps,\n",
    "    learning_rate=learning_rate,\n",
    "    weight_decay=weight_decay,\n",
    "    bf16=bf16,\n",
    "    max_grad_norm=max_grad_norm,\n",
    "    max_steps=max_steps,\n",
    "    warmup_ratio=warmup_ratio,\n",
    "    group_by_length=group_by_length,\n",
    "    report_to=\"wandb\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c6ad908-f10e-4605-bff3-efdfbe722a5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "trainer = SFTTrainer(\n",
    "    model=model,\n",
    "    train_dataset=dataset,\n",
    "    dataset_text_field=\"text\",\n",
    "    tokenizer=tokenizer,\n",
    "    args=training_arguments,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ef576ca-fa8d-4045-8a75-3e74116d4abf",
   "metadata": {},
   "outputs": [],
   "source": [
    "trainer.train()\n",
    "\n",
    "# Save trained model\n",
    "trainer.model.save_pretrained(new_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11b2e768-2b8c-4028-a385-37304eead298",
   "metadata": {},
   "source": [
    "## Deploy for inference!\n",
    "\n",
    "To deploy this model for extremely quick inference, we use VLLM and host an OpenAI compatible endpoint. You might have to restart the kernel again and then just run the cells below. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f8def9f-53c8-47ab-ae79-068b0a282017",
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "668ca2b3-c0c4-4ca8-8331-7c6612f2dcd0",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pip install vllm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4862b83e-ec92-43bc-b7f6-d7aa7712430f",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!python -O -u -m vllm.entrypoints.openai.api_server \\\n",
    "    --host=127.0.0.1 \\\n",
    "    --port=8000 \\\n",
    "    --model=brev-llama3-8b-SFT \\\n",
    "    --tokenizer=meta-llama/Meta-Llama-3-8B \\\n",
    "    --tensor-parallel-size=2\n",
    "\n",
    "\n",
    "\n",
    "# Open up a terminal and run\n",
    "#curl http://localhost:8000/v1/completions \\\n",
    "#    -H \"Content-Type: application/json\" \\\n",
    "#    -d '{\n",
    "#        \"model\": \"brev-llama3-8b-SFT\",\n",
    "#        \"prompt\": \"What is San Francisco\",\n",
    "#        \"max_tokens\": 30,\n",
    "#        \"temperature\": 0\n",
    "#    }'"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
